Skip to content

[core] Move stream reconnect logic to getReadable level#1847

Merged
VaguelySerious merged 14 commits into
stablefrom
peter/stream-control-at-getreadable-level
Jun 11, 2026
Merged

[core] Move stream reconnect logic to getReadable level#1847
VaguelySerious merged 14 commits into
stablefrom
peter/stream-control-at-getreadable-level

Conversation

@VaguelySerious

@VaguelySerious VaguelySerious commented Apr 23, 2026

Copy link
Copy Markdown
Member

Moves stream reconnect handling out of the world-vercel adapter and up to the getReadable/core level, where chunk framing already lives — so reconnect works the same way across world adapters.

Reverts #1790 (the adapter-level control-frame approach). The reconnecting reader counts the 4-byte length-prefixed frames it has received and, on a connection error, reopens the stream from startIndex + framesConsumed. A clean end-of-stream is treated as completion (no reconnect). Object/serialized streams only — raw byte streams have no wire framing to count and are opted out (the caller owns its own reconnect strategy). Bounded by a consecutive-failure cap (reset on forward progress) plus an absolute total-reconnect backstop.

Closes #1801
Closes #1802

After shipping this

Forward-ported to main in #2318. See the cross-PR comment for merge order — this only takes effect once paired with the coordinated server-side change that errors a timed-out stream connection instead of closing it cleanly.

@changeset-bot

changeset-bot Bot commented Apr 23, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 4a0258b

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 17 packages
Name Type
@workflow/world-vercel Patch
@workflow/core Patch
@workflow/cli Patch
@workflow/web Patch
@workflow/builders Patch
@workflow/next Patch
@workflow/nitro Patch
@workflow/vitest Patch
@workflow/web-shared Patch
workflow Patch
@workflow/world-testing Patch
@workflow/astro Patch
@workflow/nest Patch
@workflow/rollup Patch
@workflow/sveltekit Patch
@workflow/vite Patch
@workflow/nuxt Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel

vercel Bot commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
example-nextjs-workflow-turbopack Ready Ready Preview, Comment Jun 10, 2026 8:12pm
example-nextjs-workflow-webpack Ready Ready Preview, Comment Jun 10, 2026 8:12pm
example-workflow Ready Ready Preview, Comment Jun 10, 2026 8:12pm
workbench-astro-workflow Ready Ready Preview, Comment Jun 10, 2026 8:12pm
workbench-express-workflow Ready Ready Preview, Comment Jun 10, 2026 8:12pm
workbench-fastify-workflow Ready Ready Preview, Comment Jun 10, 2026 8:12pm
workbench-hono-workflow Ready Ready Preview, Comment Jun 10, 2026 8:12pm
workbench-nitro-workflow Ready Ready Preview, Comment Jun 10, 2026 8:12pm
workbench-nuxt-workflow Ready Ready Preview, Comment Jun 10, 2026 8:12pm
workbench-sveltekit-workflow Ready Ready Preview, Comment Jun 10, 2026 8:12pm
workbench-tanstack-start-workflow Ready Ready Preview, Comment Jun 10, 2026 8:12pm
workbench-vite-workflow Ready Ready Preview, Comment Jun 10, 2026 8:12pm
workflow-docs Ready Ready Preview, Comment, Open in v0 Jun 10, 2026 8:12pm
workflow-swc-playground Ready Ready Preview, Comment Jun 10, 2026 8:12pm
workflow-tarballs Ready Ready Preview, Comment Jun 10, 2026 8:12pm
workflow-web Ready Ready Preview, Comment Jun 10, 2026 8:12pm

@github-actions

github-actions Bot commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

🧪 E2E Test Results

Some tests failed

Summary

Passed Failed Skipped Total
❌ ▲ Vercel Production 922 1 67 990
✅ 💻 Local Development 994 0 86 1080
✅ 📦 Local Production 994 0 86 1080
✅ 🐘 Local Postgres 994 0 86 1080
✅ 🪟 Windows 90 0 0 90
❌ 🌍 Community Worlds 130 92 6 228
✅ 📋 Other 504 0 36 540
Total 4628 93 367 5088

❌ Failed Tests

▲ Vercel Production (1 failed)

vite (1 failed):

  • sleepWithSequentialStepsWorkflow - sequential steps work with concurrent sleep (control) | wrun_01KTTRX1EEHJT9R3HNMTBPBJGD | 🔍 observability
🌍 Community Worlds (92 failed)

mongodb (14 failed):

  • hookWorkflow is not resumable via public webhook endpoint | wrun_01KTTRFJMZ6V3RDX89MZSNZ9XY
  • webhookWorkflow | wrun_01KTTRFTM23WFNE06BANHG475K
  • sleepingWorkflow | wrun_01KTTRG1FYXAJ3MN4987EH643G
  • outputStreamWorkflow no startIndex (reads all chunks)
  • outputStreamWorkflow negative startIndex (reads from end)
  • outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns correct index after stream completes
  • outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns -1 before any chunks are written
  • outputStreamWorkflow - getTailIndex and getStreamChunks getStreamChunks returns same content as reading the stream
  • outputStreamInsideStepWorkflow - getWritable() called inside step functions | wrun_01KTTRJWG7WA0WFA05FVANG4Q2
  • writableForwardedFromWorkflowWorkflow | wrun_01KTTRK9M5K8W6DVSAAK6T9NJF
  • writableForwardedFromStepWorkflow | wrun_01KTTRKDQC4ZG1ETVGC88445V9
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously | wrun_01KTTRQHQA25MTBBNY59H41CCQ
  • pages router sleepingWorkflow via pages router
  • resilient start: addTenWorkflow completes when run_created returns 500 | wrun_01KTTRZ27MQK6QBMJZMAWBE8YD

redis (10 failed):

  • hookWorkflow | wrun_01KTTRFBT7E3JV0BECDRSVJRW9
  • hookWorkflow is not resumable via public webhook endpoint | wrun_01KTTRFJMZ6V3RDX89MZSNZ9XY
  • sleepingWorkflow | wrun_01KTTRG1FYXAJ3MN4987EH643G
  • outputStreamWorkflow negative startIndex (reads from end)
  • outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns correct index after stream completes
  • outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns -1 before any chunks are written
  • outputStreamWorkflow - getTailIndex and getStreamChunks getStreamChunks returns same content as reading the stream
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously | wrun_01KTTRQHQA25MTBBNY59H41CCQ
  • pages router sleepingWorkflow via pages router
  • resilient start: addTenWorkflow completes when run_created returns 500 | wrun_01KTTRZ27MQK6QBMJZMAWBE8YD

turso (68 failed):

  • addTenWorkflow | wrun_01KTTREEZXQEZ435BXJ3NGR42J
  • addTenWorkflow | wrun_01KTTREEZXQEZ435BXJ3NGR42J
  • wellKnownAgentWorkflow (.well-known/agent) | wrun_01KTTRESEA7AE2M3DWABXYSQR9
  • should work with react rendering in step
  • promiseAllWorkflow | wrun_01KTTRENRHWX6SECVCZXWTP0TN
  • promiseRaceWorkflow | wrun_01KTTREVE71KGCSNDACKXZ08G8
  • promiseAnyWorkflow | wrun_01KTTREXHA18JAQC5WD7DDP81M
  • importedStepOnlyWorkflow | wrun_01KTTRF5VM7A2TF17HTSDFWRDM
  • readableStreamWorkflow | wrun_01KTTREZRN5B2GEQDQHVSRXKWP
  • hookWorkflow | wrun_01KTTRFBT7E3JV0BECDRSVJRW9
  • hookWorkflow is not resumable via public webhook endpoint | wrun_01KTTRFJMZ6V3RDX89MZSNZ9XY
  • webhookWorkflow | wrun_01KTTRFTM23WFNE06BANHG475K
  • sleepingWorkflow | wrun_01KTTRG1FYXAJ3MN4987EH643G
  • parallelSleepWorkflow | wrun_01KTTRGHBTBRSZP6HN856RPZDB
  • nullByteWorkflow | wrun_01KTTRGMJYQ956R3YGC81PCNXV
  • workflowAndStepMetadataWorkflow | wrun_01KTTRGPS3XBF2E0PMDV3G4XJN
  • outputStreamWorkflow no startIndex (reads all chunks)
  • outputStreamWorkflow positive startIndex (skips first chunk)
  • outputStreamWorkflow negative startIndex (reads from end)
  • outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns correct index after stream completes
  • outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns -1 before any chunks are written
  • outputStreamWorkflow - getTailIndex and getStreamChunks getStreamChunks returns same content as reading the stream
  • outputStreamInsideStepWorkflow - getWritable() called inside step functions | wrun_01KTTRJWG7WA0WFA05FVANG4Q2
  • writableForwardedFromWorkflowWorkflow | wrun_01KTTRK9M5K8W6DVSAAK6T9NJF
  • writableForwardedFromStepWorkflow | wrun_01KTTRKDQC4ZG1ETVGC88445V9
  • fetchWorkflow | wrun_01KTTRKHG3335J5GR1TEPN71Z2
  • promiseRaceStressTestWorkflow | wrun_01KTTRKTQR51GBWDHWWW595HZ1
  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling error propagation workflow errors cross-file imports preserve message and stack trace
  • error handling error propagation step errors basic step error preserves message and stack trace
  • error handling error propagation step errors cross-file step error preserves message and function names in stack
  • error handling retry behavior regular Error retries until success
  • error handling retry behavior FatalError fails immediately without retries
  • error handling retry behavior RetryableError respects custom retryAfter delay
  • error handling retry behavior maxRetries=0 disables retries
  • error handling catchability FatalError can be caught and detected with FatalError.is()
  • error handling not registered WorkflowNotRegisteredError fails the run when workflow does not exist
  • error handling not registered StepNotRegisteredError fails the step but workflow can catch it
  • error handling not registered StepNotRegisteredError fails the run when not caught in workflow
  • hookCleanupTestWorkflow - hook token reuse after workflow completion | wrun_01KTTRQ5H4GSGSQ0QHYG90B9YH
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously | wrun_01KTTRQHQA25MTBBNY59H41CCQ
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running | wrun_01KTTRR0DS6H5YS7WRGE4F9W6J
  • stepFunctionPassingWorkflow - step function references can be passed as arguments (without closure vars) | wrun_01KTTRRH9MGK6JQ6DQHAE944VT
  • stepFunctionWithClosureWorkflow - step function with closure variables passed as argument | wrun_01KTTRRSZBKXGZQY2GG1YADFY1
  • closureVariableWorkflow - nested step functions with closure variables | wrun_01KTTRRZFNE0WS4358KN339TYS
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step | wrun_01KTTRS1FTJ2V10Y4F2PYA5653
  • health check (queue-based) - workflow and step endpoints respond to health check messages
  • health check (CLI) - workflow health command reports healthy endpoints
  • pathsAliasWorkflow - TypeScript path aliases resolve correctly | wrun_01KTTRSF7JD7RJQVGER1W9V5B3
  • Calculator.calculate - static workflow method using static step methods from another class | wrun_01KTTRSMM4T19KT3M2GT93CJNG
  • AllInOneService.processNumber - static workflow method using sibling static step methods | wrun_01KTTRSVAZF0HH71FFET205NE4
  • ChainableService.processWithThis - static step methods using this to reference the class | wrun_01KTTRT1SF7ZY0XBPGJYYN0K0D
  • thisSerializationWorkflow - step function invoked with .call() and .apply() | wrun_01KTTRT8CRJ8935ZV0ETXN7JR4
  • customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE | wrun_01KTTRTF4EJBR080XK1GPHAY20
  • instanceMethodStepWorkflow - instance methods with "use step" directive | wrun_01KTTRTNWZY6HY5554ZPEH61KR
  • crossContextSerdeWorkflow - classes defined in step code are deserializable in workflow context | wrun_01KTTRV20AMJDT7VBZYQH6DJC5
  • stepFunctionAsStartArgWorkflow - step function reference passed as start() argument | wrun_01KTTRVAVSQPRY81DYQKRY0DXQ
  • cancelRun - cancelling a running workflow | wrun_01KTTRVHA4PMW4MFRQHFPJM3B3
  • cancelRun via CLI - cancelling a running workflow | wrun_01KTTRVTM64YZFC25VZKKPN5NW
  • pages router addTenWorkflow via pages router
  • pages router promiseAllWorkflow via pages router
  • pages router sleepingWorkflow via pages router
  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep | wrun_01KTTRW6RN7M6MZHNPXJ96H6C6
  • sleepInLoopWorkflow - sleep inside loop with steps actually delays each iteration | wrun_01KTTRWP4J2JEJR8CTS98BAJYD
  • sleepWithSequentialStepsWorkflow - sequential steps work with concurrent sleep (control) | wrun_01KTTRX1EEHJT9R3HNMTBPBJGD
  • importMetaUrlWorkflow - import.meta.url is available in step bundles | wrun_01KTTRYXV3NN19055TB2MGD052
  • metadataFromHelperWorkflow - getWorkflowMetadata/getStepMetadata work from module-level helper (#1577) | wrun_01KTTRYZZ2VQWJ547KQY598T9T
  • resilient start: addTenWorkflow completes when run_created returns 500 | wrun_01KTTRZ27MQK6QBMJZMAWBE8YD

Details by Category

❌ ▲ Vercel Production
App Passed Failed Skipped
✅ astro 83 0 7
✅ example 83 0 7
✅ express 83 0 7
✅ fastify 83 0 7
✅ hono 83 0 7
✅ nextjs-turbopack 88 0 2
✅ nextjs-webpack 88 0 2
✅ nitro 83 0 7
✅ nuxt 83 0 7
✅ sveltekit 83 0 7
❌ vite 82 1 7
✅ 💻 Local Development
App Passed Failed Skipped
✅ astro-stable 84 0 6
✅ express-stable 84 0 6
✅ fastify-stable 84 0 6
✅ hono-stable 84 0 6
✅ nextjs-turbopack-canary 71 0 19
✅ nextjs-turbopack-stable 90 0 0
✅ nextjs-webpack-canary 71 0 19
✅ nextjs-webpack-stable 90 0 0
✅ nitro-stable 84 0 6
✅ nuxt-stable 84 0 6
✅ sveltekit-stable 84 0 6
✅ vite-stable 84 0 6
✅ 📦 Local Production
App Passed Failed Skipped
✅ astro-stable 84 0 6
✅ express-stable 84 0 6
✅ fastify-stable 84 0 6
✅ hono-stable 84 0 6
✅ nextjs-turbopack-canary 71 0 19
✅ nextjs-turbopack-stable 90 0 0
✅ nextjs-webpack-canary 71 0 19
✅ nextjs-webpack-stable 90 0 0
✅ nitro-stable 84 0 6
✅ nuxt-stable 84 0 6
✅ sveltekit-stable 84 0 6
✅ vite-stable 84 0 6
✅ 🐘 Local Postgres
App Passed Failed Skipped
✅ astro-stable 84 0 6
✅ express-stable 84 0 6
✅ fastify-stable 84 0 6
✅ hono-stable 84 0 6
✅ nextjs-turbopack-canary 71 0 19
✅ nextjs-turbopack-stable 90 0 0
✅ nextjs-webpack-canary 71 0 19
✅ nextjs-webpack-stable 90 0 0
✅ nitro-stable 84 0 6
✅ nuxt-stable 84 0 6
✅ sveltekit-stable 84 0 6
✅ vite-stable 84 0 6
✅ 🪟 Windows
App Passed Failed Skipped
✅ nextjs-turbopack 90 0 0
❌ 🌍 Community Worlds
App Passed Failed Skipped
✅ mongodb-dev 3 0 2
❌ mongodb 57 14 0
✅ redis-dev 3 0 2
❌ redis 61 10 0
✅ turso-dev 3 0 2
❌ turso 3 68 0
✅ 📋 Other
App Passed Failed Skipped
✅ e2e-local-dev-nest-stable 84 0 6
✅ e2e-local-dev-tanstack-start-stable 84 0 6
✅ e2e-local-postgres-nest-stable 84 0 6
✅ e2e-local-postgres-tanstack-start-stable 84 0 6
✅ e2e-local-prod-nest-stable 84 0 6
✅ e2e-local-prod-tanstack-start-stable 84 0 6

📋 View full workflow run


Some E2E test jobs failed:

  • Vercel Prod: failure
  • Local Dev: success
  • Local Prod: success
  • Local Postgres: success
  • Windows: success

Check the workflow run for details.

⚠️ Community world tests failed (non-blocking):

  • Community Worlds: failure

Check the workflow run for details.

Signed-off-by: Peter Wielander <mittgfu@gmail.com>

@TooTallNate TooTallNate left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The architectural shift makes sense: client-side frame counting is a cleaner abstraction than wire-level control frames, and moving it to core means it works for any world that returns a ReadableStream from readFromStream, not just world-vercel. The reconnect math, frame-counting, and partial-frame discard are all correct.

But there are two significant concerns I think need addressing before merge.

1. Byte streams lose auto-reconnect entirely

The PR explicitly opts byte streams out of reconnect:

if (value.type === 'bytes') {
  // No auto-reconnect here: raw byte streams have no wire framing
  const readable = new WorkflowServerReadableStream(value.name, value.startIndex);
  // ...
} else {
  const readable = createReconnectingFramedStream(value.name, value.startIndex);
  // ...
}

The reason given is technically correct (no wire framing → no chunk boundary detection client-side), but this is a regression vs. the reverted #1790, which handled byte streams just fine because the server sent the resume hint via control frame.

The use cases that lose auto-reconnect:

  • AI streaming responses (text/SSE) piped from getWritable()
  • Any HTTP route doing return new Response(run.getReadable()) for raw bytes
  • Any streaming workflow output that goes more than 2 minutes (the prior server-side timeout window) and uses byte type

The docs callout added by this PR points users to WorkflowChatTransport and supportsCancellation, but those address a different problem (cancellation, not reconnect). Pushing reconnect to the application layer — where every consumer has to reimplement it — is a step backward in usability.

Possible directions:

  1. Frame byte streams on the writable side too (4 bytes per chunk overhead) so createReconnectingFramedStream works for them. The user-facing surface stays raw bytes; only the wire format changes.
  2. Keep the control-frame approach for byte streams only as a hybrid — frame counting for non-byte streams, server-side hint for byte streams.
  3. Document this as an explicit limitation and update the docs callout to specifically warn about byte streams losing reconnect, not just talk about supportsCancellation (separate issue).

(1) seems best to me — it removes the asymmetry entirely and keeps the cleaner architecture.

2. The "clean EOF means done" assumption needs verification

if (result.done || !result.value) {
  // Clean EOF — stream is truly complete...
  controller.close();
  return;
}

This assumes the workflow-server signals "done" and "timeout/aborted" differently at the network level — clean done = FIN, timeout = error/reset. The deleted control-frame logic disambiguated these because both manifested as clean closes from a TCP perspective; the magic-footer frame was the disambiguator.

Without the control frame, the new code can't tell them apart. If the workflow-server's 2-minute timeout sends a clean FIN (rather than a TCP reset or stream error), this PR will appear to "complete" any stream that hits 2 minutes.

Is that assumption verified against the actual server behavior? The new test simulates max-duration as controller.error(...), which is fine for the unit test, but I'd want to see either:

  • An e2e test confirming a real long-lived stream against workflow-server triggers reconnect (not premature close)
  • A statement in the PR description / commit explaining why the server-side timeout is now an error not a clean close (was the workflow-server changed? was the timeout removed?)

The supportsCancellation callout suggests the architecture has shifted such that streams now run for the full function maxDuration rather than the old 2-minute server timeout — but if so, that's a precondition for this PR and worth calling out explicitly.

Minor

See inline comments.

What looks good

  • Frame-counting math is correct: currentStartIndex += consumedFrames resumes at the right place, partial-frame buffer is correctly discarded, the math is symmetric for non-zero initial startIndex.
  • Negative startIndex correctly bypasses reconnect with a clear reason (can't compute absolute resume index without a tail-index lookup) — and there's a test for it.
  • AbortController plumbing in world-vercel readFromStream is the right primitive. Cancel propagation through cancel(reason) { abortController.abort(reason) } correctly tears down the fetch.
  • Test coverage for createReconnectingFramedStream is good — frames split across reads, partial frame at error, clean EOF, non-zero initial startIndex, negative startIndex bypass, cancel propagation. Six tests, all targeted.
  • Two changesets correctly scoped: @workflow/core for the new wrapper, @workflow/world-vercel for the cancel propagation.

value.name,
value.startIndex
);
if (value.type === 'bytes') {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Byte streams are intentionally opted out of auto-reconnect here. This is a behavioral regression vs. the reverted #1790, which handled byte streams via server-sent control frames.

The comment correctly identifies why this is hard (no wire framing → no chunk boundary detection client-side), but pushing reconnect to the application layer means:

  1. Every consumer of run.getReadable() for byte streams (AI text streaming, raw HTTP responses, etc.) has to implement its own reconnect logic.
  2. The docs callout added by this PR (about supportsCancellation) doesn't actually help — that's a cancellation fix, not a reconnect fix.

I think the right move is to frame byte streams on the writable side too (4 bytes per chunk overhead), so createReconnectingFramedStream can be used uniformly. The user-facing API stays raw bytes; only the wire format gets the length prefix. That removes the asymmetry and keeps the cleaner architecture this PR is trying to achieve.

* the writable buffers one frame per chunk when multi-writing). The wrapper
* counts completed frames and, on upstream error, reopens the connection
* with `startIndex = resolvedStartIndex + consumedFrames`. Partial-frame
* bytes buffered before the cut are discarded — the server will resend the

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment says On serverfull backends, reconnects should only happen during transient errors. For serverless backends, we set this constant so that we cover at least 10 minutes even if the server would be limited to e.g. 1 minute per session.

10 reconnects \u00d7 1-minute-per-session = 10 minutes covered. That's tighter than the deleted constant in world-vercel (MAX_RECONNECTS = 50, ~100 minutes coverage at 2-min server timeouts). If the underlying assumption is that streams now run for full function maxDuration (which on Pro/Enterprise can exceed 10 minutes), this cap may be too low.

Worth either:

  1. Bumping the constant to match the longest realistic maxDuration (~15 min Pro), so something like 30, or
  2. Making it configurable per-call (or via the world)

console.warn("Error closing ReadableStream reader:", err)
});
reader = undefined;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: cancel() here only cancels the active reader. There's a small race window: if cancel fires while connect() is in flight (between reader = undefined after a reconnect-triggering error and the new reader being assigned), there's nothing to cancel — the new connection completes and the loop continues reading.

A cancelled flag checked at the top of the pull loop and inside connect() would close this. Same race existed in the deleted world-vercel cancel handler, so it's not a regression — just worth tightening if you're touching this code.

let cancelled = false;
// ... in pull loop, top of for(;;):
if (cancelled) { controller.close(); return; }
// ... in cancel:
cancelled = true;

const { world } = makeWorldWithScriptedStreams({
0: () =>
scriptedStream([
// Split frame into 3 byte-level reads to prove boundary-aware

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test simulates max-duration abort as controller.error(...) — which is correct for what the wrapper sees on a network reset, but doesn't verify the actual workflow-server behavior matches.

If workflow-server's stream timeout sends a clean FIN (i.e., calls controller.close() on its end) instead of an error, this code path will treat it as EOF and not reconnect. The control-frame logic that this PR removes was specifically designed to disambiguate these two cases.

Could you confirm in the PR description whether:

  1. workflow-server's stream timeout has been removed entirely (streams now run for full function maxDuration), OR
  2. the timeout still exists but now manifests as a network error / TCP reset rather than a clean FIN?

This is the load-bearing assumption of the whole design.

@TooTallNate

Copy link
Copy Markdown
Member

Following up after the discussion thread — consolidating the recommended direction so it's all in one place.

Recommended direction

Move byte-stream framing into core, gated on a per-run feature flag, with the resolved choice baked into the serialized stream ref.

The PR's instinct (move reconnect to core) is right. The concrete change to make it work uniformly for byte streams:

1. Frame byte streams on the writer side

In serialization.ts, the byte-stream branch of the ReadableStream reducer currently does:

ops.push(value.pipeTo(writable));

It would become:

ops.push(
  value
    .pipeThrough(getByteFramingStream())  // wrap each chunk in [4-byte len][bytes]
    .pipeTo(writable)
);

Cost: 4 bytes per server-side chunk. For typical streaming workloads (AI text chunks of dozens of bytes, structured byte payloads in the KB+ range) this is well under 5% overhead.

2. Use createReconnectingFramedStream for both branches on the reader side

The non-byte branch already does. The byte branch additionally pipes through an unframing transform that strips the 4-byte length and emits raw bytes to a type: 'bytes' WHATWG stream — preserving the user-facing API exactly as it is today.

3. WHATWG type: 'bytes' semantics are unaffected

To clarify a point that came up in the discussion: WHATWG's type: 'bytes' is purely about the reader-side API (BYOB readers, Uint8Array chunks, optional autoAllocateChunkSize). The spec says nothing about wire format or chunk-boundary semantics. Whether the bytes are framed on the wire is a transport choice the SDK gets to make — it doesn't change what the user sees from getReader().

So the framing change is purely internal to serialization. User-facing API is identical.

Backwards compatibility

This is the load-bearing concern, since byte-stream wire format becomes a versioning surface.

Cross-version exposures (post-version-skew-protection)

Within a single run: no exposure. Workflow runs are pinned to one deployment, so all chunks of any stream within a run are written and read by the same SDK version.

The only real exposures are streams that cross the run boundary via hook payloads, where the producer and consumer can be different SDK versions:

  1. Newer caller → older run (resumeHook(token, { stream: writable }) where the older run writes to it): older writer can't frame, newer reader must accept raw.
  2. Newer caller → older run (resumeHook(token, { stream: readable }) where the older run reads from it): older reader can't unframe, newer writer must produce raw.
  3. Older caller → newer run: mirror cases — newer side must defer to the older side's format.

In all cases, the framing decision must be made at the producer side based on the consumer side's capability.

Proposed mechanism

  • Per-run feature flag in run.features, e.g. 'byte-stream-framing'. Set at run-creation time based on the SDK version of the run's pinned deployment.
  • NOT specVersion: that's reserved for World-protocol changes (queue transport, event schemas). Byte-stream framing is purely a core/serialization concern that worlds don't need to know about. Features are the right granularity.
  • Reducer resolves at serialization time: the ReadableStream / WritableStream reducer looks up the target run's features and decides framing. For hook payloads the target is the hook's owning run (already looked up by the resumeHook code path); for same-run streams the target is the current run.
  • Bake the resolved choice into the stream ref:
    ReadableStream:
      | { name: string; type?: 'bytes'; startIndex?: number; framing?: 'raw' | 'framed-v1' }
      | { bodyInit: any };
  • Reader dispatches on the ref field: framing === 'framed-v1' → use createReconnectingFramedStream + unframing transform; framing === undefined | 'raw' → use existing WorkflowServerReadableStream (no reconnect).
  • Default is raw: absence of the field means raw, so existing serialized refs from older SDKs still work.
  • Auto-reconnect for byte streams becomes opt-in for new runs only. Old runs keep current no-reconnect behavior. Consistent with how feature flags work elsewhere in the codebase.

One implementation note

For start(workflow, args, { deploymentId }) with cross-deployment args, the args are serialized before the target run exists. The reducer needs a path to predict features for the target deployment without an actual run object — probably reading the deployment manifest's SDK version. Worth confirming this lookup is feasible at reducer-call time before committing to the design.

What's still open in the current PR

The architectural shift (frame counting in core, simpler world-vercel transport, AbortController plumbing) is good and should land. The two outstanding points from my prior review:

  1. Byte streams are opted out of reconnect — addressed by the above.
  2. "Clean EOF means done" assumption — still worth verifying explicitly. Either confirm that workflow-server's stream timeout now manifests as a network error (not a clean FIN), or document that this design only works if the server signals timeout-via-error.

The framing change for byte streams could be a follow-up PR if you want to keep the scope of this one tight, but the docs callout should at minimum be updated to clarify that byte streams currently lose auto-reconnect, distinct from the supportsCancellation issue (which is about cancellation, not reconnect).

@TooTallNate TooTallNate left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — withdrawing my prior request-for-changes.

Context: my earlier blocker was that this PR opts byte streams out of auto-reconnect, which I called a regression vs. the now-reverted #1790. Since then we discussed it and settled on a different plan: this PR lands on stable as-is (object-stream reconnect only), and byte-stream support gets added on main/v5 via wire-level framing in a follow-up. The framing work is now in PRs #1854 (workflowCoreVersion on HealthCheckResult) and #1853 (the framing itself), which together let createReconnectingFramedStream be applied uniformly to byte streams on main once they land.

So for stable, this PR is the right scope:

  • Object-stream reconnect via createReconnectingFramedStream is correct.
  • Byte streams legitimately can't be auto-reconnected with the legacy unframed wire format that stable ships, so opting them out is the right call there.
  • Frame-counting math, AbortController plumbing, world-vercel simplification all look good.

The earlier non-blocking concerns I raised still apply — would be nice to address them but I'm not gating on them:

  1. The "clean EOF means done" assumption. Worth a sentence in the commit/PR description confirming whether workflow-server's stream timeout now manifests as a network error rather than a clean FIN, since the deleted control-frame logic was specifically there to disambiguate them.
  2. FRAMED_STREAM_MAX_RECONNECTS = 10 is tighter than the deleted MAX_RECONNECTS = 50. Probably fine, but worth a sanity check against the longest realistic Pro/Enterprise maxDuration.
  3. Cancel race during reconnect — pre-existing, not a regression here.

Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Comment thread docs/content/docs/deploying/world/vercel-world.mdx Outdated
Comment thread docs/content/docs/foundations/streaming.mdx Outdated
Co-authored-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
VaguelySerious and others added 3 commits May 29, 2026 15:47
…l-at-getreadable-level

# Conflicts:
#	packages/world-vercel/src/streamer.test.ts
#	packages/world-vercel/src/streamer.ts
…preview

- serialization: reset reconnectCount to 0 when a reconnect delivers a
  frame, so FRAMED_STREAM_MAX_RECONNECTS bounds *consecutive* failures
  (as documented) instead of the lifetime total. Long-lived serverless
  streams that reconnect repeatedly but keep delivering no longer get
  falsely capped. Export the constant for tests.
- tests: add coverage for the max-consecutive-reconnect cap, the
  budget-reset-on-progress regression, and multi-frame-per-read drain.
- world-vercel: temporarily point WORKFLOW_SERVER_URL_OVERRIDE at the
  peter-stream-timeout-error workflow-server preview so this branch's
  e2e exercises the matching server-side stream-timeout behavior. To be
  cleared before merge (see comment).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The consecutive-failure cap resets on forward progress, which is correct
for a backend that honors startIndex. But a World whose readFromStream
ignored startIndex and re-delivered earlier chunks would report progress
on every reconnect, so the consecutive cap would never trip — an
unbounded reconnect loop. Add FRAMED_STREAM_MAX_TOTAL_RECONNECTS (1000),
a hard ceiling that never resets, so the loop always terminates while
staying far above any legitimate long-lived stream. Add a test covering
the pathological ignore-startIndex case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@VaguelySerious VaguelySerious left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI review: blocking issues found

Comment thread packages/world-vercel/src/utils.ts Outdated
*/
const WORKFLOW_SERVER_URL_OVERRIDE = '';
const WORKFLOW_SERVER_URL_OVERRIDE =
'https://workflow-server-git-peter-stream-timeout-error.vercel.sh';

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review: Blocking

The hard-coded server override must be cleared back to '' before this merges — as written it points every consumer's stream traffic at a preview deployment. You've already flagged it as temporary and the two red CI signals are by-design, so this is just the merge gate: don't land until this constant is reset and those checks go green.

// fetch implementations differ on whether cancelling the body
// alone tears down the socket.
return new ReadableStream<Uint8Array>({
start(controller) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review: Note

The rewrap pumps the upstream eagerly in start() — the loop calls reader.read()controller.enqueue(value) with no backpressure check, so it drains the upstream as fast as the socket delivers regardless of how fast the downstream consumer reads. The previous code returned response.body directly, which propagates backpressure to the socket. For a fast producer + slow consumer (exactly the long-lived streaming case this path serves), the wrapper can buffer the whole stream in the controller queue → unbounded memory.

Consider a pull-based source instead of an eager start() pump (read one chunk per pull, enqueue, return), or gate the pump on controller.desiredSize. That keeps cancel→abort propagation while preserving backpressure.

* hard ceiling guarantees the loop always terminates. It is set high enough
* (hours of streaming at realistic per-session timeouts) to never interfere
* with legitimate long-lived streams.
*/

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review: Note

With the server now aborting in-progress streams at max-duration (rather than closing cleanly), the negative-startIndex branch becomes a behavior change worth calling out: those reads opt out of reconnect, so a last-N consumer that previously saw a clean EOF at the duration limit will now surface the abort as a hard error. The object-stream consumers that use negative indices (e.g. tail-resolving clients) resolve to an absolute index before connecting, so in practice they shouldn't hit this — but a doc line or a comment noting "negative startIndex + mid-stream server abort = error, not silent close" would save a future debugging session.

@VaguelySerious

Copy link
Copy Markdown
Member Author

(AI) Cross-PR context & merge order

Together these make run.getReadable() transparently reconnect when a stream's underlying connection ends mid-stream (e.g. the periodic server-side max-duration cutoff), instead of surfacing a truncated stream to the consumer.

How the pieces fit:

Behavioural note: the reconnecting reader only reopens on a connection error — a clean close means "complete". So the client change is inert on its own and only takes effect once paired with the coordinated server-side change that ends a timed-out connection with an error rather than a silent close (handled separately). Shipping the client first is therefore safe and must precede that server change.

Suggested order:

  1. [core] Move stream reconnect logic to getReadable level #1847stable (client reconnect; a no-op until the paired server change ships).
  2. [core] Forward-port stream reconnect to getReadable level #2318main (same, on main). Release both so deployed apps gain reconnect.
  3. The coordinated server-side timeout change — only after the reconnect-capable client is released.

#1853 is independent and can land on its own schedule; it unblocks byte-stream reconnect as a future follow-up.

@TooTallNate TooTallNate left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve — with one hard pre-merge gate (the URL override) and two non-blocking design notes

I built @workflow/core + @workflow/world-vercel from this branch and ran the new suites locally: all 10 reconnecting-framed-stream tests and all 20 streamer tests pass.

The design is right

Moving reconnect from the adapter's wire-sniffing control-frame approach (#1790, reverted here) up to the framing layer is the correct factoring. The framing layer is the one place that already knows chunk boundaries, so "count completed frames, resume at startIndex + consumed" needs no wire protocol additions at all — and it works identically for any World adapter, not just world-vercel.

Specifics I verified:

  • Partial-frame discard on reconnect is correct: buffered mid-frame bytes are dropped, currentStartIndex += consumedFrames, and the server resends the in-flight chunk in full. The test at line 140 simulates exactly the production scenario (3 bytes of a frame, then a max-duration abort) and asserts the resume index.
  • The two-tier budget is well-reasoned. Consecutive cap (50) resets on forward progress, so a long-lived stream that reconnects hundreds of times while still delivering is never falsely killed — tested with FRAMED_STREAM_MAX_RECONNECTS + 5 productive reconnects. The absolute backstop (1000) covers the pathological case where a backend ignores startIndex and reports false progress forever — also tested. Both constants exported and documented with the reasoning.
  • Clean EOF = completion, error = reconnect is the right contract, and negative startIndex (last-N) correctly opts out since an absolute resume position can't be computed.
  • Frames pass through with headers intact to the downstream deserializer, which already expects the framed layout — the wrapper only counts, it doesn't re-frame. Nice and minimal.
  • The byte-stream opt-out is correctly scoped (raw streams have no framing to count) and the doc callouts about supportsCancellation for long-lived stream routes are a genuinely useful addition independent of this change.

Hard gate before merge

WORKFLOW_SERVER_URL_OVERRIDE is pointed at a preview deployment. The in-code comment documents this as temporary and correctly predicts the red CI (the override lint guard + the 4 utils.test.ts override-precedence cases — I checked the failing unit job and those 4 are precisely the failures). This must go back to '' before merge, and CI needs one green re-run after the reset — the current Tests run is also three weeks old (May 29) relative to the branch head, and the MongoDB/Redis community-world results from that run are stale enough that I wouldn't sign off on them either way without a fresh run.

Non-blocking notes

  1. A reconnect-time connection failure is fatal rather than budgeted. The retry budget only covers reader.read() errors. If reconnect()connect() itself throws (the reopen fetch fails transiently — plausible during exactly the kind of server blip that triggers reconnect in the first place), the catch in pull errors the stream immediately with budget remaining. Folding connect failures into the same budgeted loop (count it, retry) would make the wrapper robust against the scenario it exists for. Fine as a follow-up.

  2. The streamer's cancel-propagation wrapper trades away backpressure. The eager pump() loop in readFromStream reads upstream as fast as the network delivers and enqueues without consulting desiredSize, so a slow consumer now buffers the stream in the controller queue instead of letting the socket backpressure naturally (the old code returned response.body directly, which is pull-driven). A pull-based wrapper would keep the AbortController plumbing and preserve backpressure. For typical workflow stream sizes this is unlikely to matter, but it's an unnecessary semantic change for what is otherwise just abort plumbing.

  3. Changeset bump types (patch for both packages on the GA channel) are defensible since this fixes silent truncation, even though it adds new behavior.

Once the override is reset and CI is green on a current run, this is good to land. The cross-PR sequencing in the description (this merging and releasing before the coordinated server-side behavior change takes effect) is the right order — until the server change ships, this code path simply never triggers, which makes it safe to release ahead.

Comment thread packages/world-vercel/src/utils.ts Outdated
Comment on lines +63 to +75
*
* NOTE (temporary): this is intentionally pointed at the
* `peter-stream-timeout-error` workflow-server preview so this branch's e2e
* tests exercise the matching server-side stream-timeout behavior. It will be
* cleared back to '' once those server-side changes merge — not a review
* concern. While it is set, two CI signals are red by design and will go
* green again on reset: the "WORKFLOW_SERVER_URL_OVERRIDE is empty" lint
* guard, and the override-precedence cases in `utils.test.ts` (the hardcoded
* value intentionally wins over the env var, which those cases assert is
* absent).
*/
const WORKFLOW_SERVER_URL_OVERRIDE = '';
const WORKFLOW_SERVER_URL_OVERRIDE =
'https://workflow-server-git-peter-stream-timeout-error.vercel.sh';

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*
* NOTE (temporary): this is intentionally pointed at the
* `peter-stream-timeout-error` workflow-server preview so this branch's e2e
* tests exercise the matching server-side stream-timeout behavior. It will be
* cleared back to '' once those server-side changes merge — not a review
* concern. While it is set, two CI signals are red by design and will go
* green again on reset: the "WORKFLOW_SERVER_URL_OVERRIDE is empty" lint
* guard, and the override-precedence cases in `utils.test.ts` (the hardcoded
* value intentionally wins over the env var, which those cases assert is
* absent).
*/
const WORKFLOW_SERVER_URL_OVERRIDE = '';
const WORKFLOW_SERVER_URL_OVERRIDE =
'https://workflow-server-git-peter-stream-timeout-error.vercel.sh';
*/
const WORKFLOW_SERVER_URL_OVERRIDE = '';

Reverts the temporary preview override (and its NOTE) so utils.ts has no
diff. The matching server-side stream-timeout behavior is validated via
its own PR; the SDK override must stay empty (lint guard enforces it).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot mentioned this pull request Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants